In this brief, we explore the Red Wine Quality dataset from P. Cortez et al., available through UCI’s ML repository in an effort to see what properties are relevant to wine quality.
The dataset collects observations on a variety of chemical features of Vinho Verde red wines, along with the median rating of those wines by at least three experts. We observe moderate correlations between quality and two variables – alcohol level and volatile acidity – and weaker correlations with two other varibles, sulphates and and citric acid.
There are apparent non-linear relationships between quality and the two weaker variables. For example, sulphate concentrations are positively correlated with quality, except at high concentrations, where the correlation becomes negative. This may be an edge effect, or it may signal that too much or too little of any chemical feature is a bad thing in wine.
Interestingly, some of the features that served as good predictors of quality in Cortez et al.’s model do not make themselves manifest in this analysis, and the relative importance of other predictors is slightly permuted here.
Let’s take a look at the shape of the dataset:
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
We have a complete dataset with almost 1,600 observations of 11 independent variables and 1 response variable, quality. Quality is integer-valued – a rating from 1 to 10 – but all the other variables are continuous-valued. Some of the variables, such as pH, are fairly familiar, but others, such as volatile acidity, are less so.
Next, we’ll take a look at summary statistics:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
The response variable, quality, has a minimum of 3 and a maximum of 8. In other words, the dataset doesn’t contain any extremely low-quality wines, nor does it contain any extremely low-quality wines. Most wines received either a 5 or a 6 rating. This narrow IQR suggests that it may prove difficult to tease out what makes for a good red wine.
With respect to the chemical properties, we observe some with a similary narrow range (e.g. citric acid, density), yet there are others that show a broader range. Total sulfur dioxide, for example, ranges across several orders of magnitude.
As a first pass, we’ll focus on the following variables:
To be sure, there are other variables of interest. In a later section, we’ll consider through more comprehensive and efficient means whether to pay attention to these.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
As we saw above, quality is integer-valued. The distribution is Gaussian-looking, with a slightly negative skew and short tails. Most wines are apparently so-so. Some are quite good, but not many; even fewer are quite bad. There are no wines with a score greater than 8 or less than 3, so there are no truly exceptional or truly terrible wines in our sample.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
It seems that most wines are in the 9%-12% alcohol by volume (abv) range, with a spike around 9% abv. All values of alcohol are in the 8.4%-14.90% range. The histogram is quite skewed and has one clear outlier at 14.9% abv.
In the greater context of wine, this plot is mildly surprising . Most red wines, according to this infographic, fall in the 12%-16% abv range, whereas the wines in this dataset are mostly in the 9%-12% abv range. It turns out, though, such a range is common for Vinho Verde, so our data seem trustworthy, and since 14.9% is still a reasonable abv for red wine in general, we won’t exclude the outlier.
Nevertheless, we should proceed with caution if we try to generalize any findings about Vinho Verde wine to wines in general.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
This is a pretty solid normal distribution, with a median nearly equal to its mean.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Here, we see a much more irregular distribution here with a very long tail. One wonders how the judges will rate wines for which the residual sugar level resides in this tail.
We can see whether a log transformation of this variable yields a distribution that’s more normal in appearance:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
This distribution is fairly normal, with some noise near the mean of the distribution. There is, again, a tail on the positive end of the distribution. Some of these values fall outside the U.S. legal limit for volatile acidity, which is 1.2 g/dm^3, but no so far beyond that limit as to deserve exclusion from the dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Here, the tail is very long, and it doesn’t disappear upon a log transformation.
For all of these variables whose distributions have long tails, are these “extremes” intentional? Are some wines especially salty or sweet on purpose, and will the judges appreciate these choices? Or are these extremes a sign of defects in the wine?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Here, we have a distribution that appears log-normal. There are a couple outliers on the positive tail, but they are within the U.S. legal limit of 350 mg/dm^3.
We now turn our attention to how chemical features relate to quality.
We should also investigate whether any of the variables we haven’t explored merit a further look. One tacit assumption thus far has been that different “acid”-like variables were similar enough that choosing one representative variables was sufficient, and similary for “sulfur”-like variables. Let’s check those assumptions.
A more efficient way to determine which variables are correlated with each other and, perhaps more importantly, with our response variable, is to look at a scatterplot matrix. We’ll use the psych package to do so. We’ll also take that as a jumping off point into other bivariate analyses.
We learn many interesting things from this matrix. First, we get Pearson correlations amongst the “acid”-like variables and the “sulfur”-like variables.
Fixed.acidity and citric.acid are fairly well-correlated (r = 0.67), while volatile.acidity and citric.acid are fairly anti-correlated (r = -0.55). And fixed.acidity and volatile.acidity are weakly anti-correlated (r = -0.26).
With respect to “sulfur”-like variables, total.sulfur.dioxide and free.sulfur.dioxide are strongly correlated (r = 0.67), but neither is correlated with sulphates.
More interesting, of course, is how all the variables relate to quality. We can classify variables as “moderate” or “weak” according to their r value.
Moderate correlations (0.3 < r < 0.5)
Weak correlations (0.2 < r < 0.3)
Unless we find a reason to consider other variables later, we’ll stick with these four as our variables of interest – dropping pH, residual sugar, and chlorides – as we continue our exploratory analysis.
Next, let’s look at scatterplots of quality and each of these predictors in turn.
The correlation noted in the matrix is evident in the scatterplot too. An interesting structure to observe is that up to about alcohol = 10% abv, a significant majority of the wines have a quality score of less than 6, while beyond alcohol = 12% abv, a significant majority have a quality score of greater than 6, but that between these two values of alcohol, there is greater variability.
We also see the lone high-alcohol wine, with abv = 14.9%, dragging down the trend line. It would be interesting to collect more data on wines at this abv level and see whether quality does tend to drop.
We see evidence here of the negative trend already described. As volatile acidity increases, quality tends to decrease. Interestingly, though, for particularly low values of volatile acidity, the trend is reversed, and this variable becomes positively correlated with quality, suggesting that a little bit of acidity (or bitterness) is a good thing.
The weak linear trend indicated in the scatterplot matrix can now be recognized as a non-linear relationship. As the sulphates level increases from its minimum, the oxidation of the wine decreases, and quality improves. At a certain sulphate level, though, the trend reverses, and quality starts going down.
The non-linearity induced by high-sulphate values, which starts around sulphates = 0.9 g/dm^3, doesn’t appear to be the artifact of a small sample size. There are 59 wines with a sulphate values greater than or equal to 0.9, or roughly 7% of the dataset.
## Source: local data frame [1 x 1]
##
## n
## 1 110
This might mean the vintners intended to produce wines at this sulphate level but were unaware of the imapct on quality.
Overall, we see the basic weak, but positive trend that we expected. There are some unexpected “flatlines” in the curve here. Between citric acid values of 0.0 and 0.2, and again between 0.4 and 0.5, quality doesn’t seem to change much.
There are also some similarities to the quality vs. alcohol plot. There is a value of citric acid (~0.25 g/dm^3) below which high-quality wine are relatively uncommon. On the other hand, there is a value of citric acid (~0.5 g/dm^3) above which low-quality wines are relatively uncommon.
Taken together, these scatterplots suggest that a single chemical factor is capable of causing a wine’s lack of extreme success or failure but that no factor on its own can guarantee a high-quality or low-quality wine.
In the next section, we’ll explore what balance of factors is needed to produce good Vinho Verde.
We now bring a third dimension into our exploratory analysis. The continuous-valued nature of our independent variables will make a color-based encoding difficult to interpret, so let’s bucket each of them by quartile.
Now let’s see whether these multvariate plots reveal any additional patterns.
As we might expect, the bucketed variables add extra explanatory power. At a given alcohol level, we see quality correlated with buckets in the ways we found earlier. For example, in the first plot, at a given alcohol level, we see a lower volatile acidity associated with a higher quality at every abv level.
Again, the weakly correlated varibles add extra explanatory power to the two-dimensional plots of quality vs. volatile acidity.
At this point, it’s natural to simplify our plots even further through aggregation. Taking the “quality vs. alcohol & sulphates” plot, let’s see what happens to the quality-alcohol correlation when we average within sulphate quartiles.
Sulphates do clearly explain some of that variance. Most of the points above the smoothing curve are high-sulphate (but not too high!), and most of the points below are low-sulphate.
Let’s examine a similar plot, this time averaging within volatile acidity quartiles.
Again, a pattern emerge amidst the sea of dots. Most of the points above the smoothing curve have a low volatile acidity, confirming the trend we saw in our bivariate analysis.
A line graph, though noisy, makes the trend stand out better:
If we smooth out noise by rounding alcohol values, then the pattern is instantly comprehensible.
On average, then, as alcohol goes up, and as volatile acidity goes down, quality increases (module edge effects).
Let’s see how sulphates behave in a plot of this type:
Now we can see that a wine with a very high sulphate concentration (sulphates > 0.9 g/dm^3) does indeed of lower quality than moderately high-sulphate wine (0.73 < sulphates <= 0.9) across all alcohol levels, albeit not by much. In other words, too low a sulphate concentration may be a bad thing for freshness, but too high a concentration may have an adverse impact on quality too.
One of the main challenges with this data set is the narrow range of quality ratings available and the discrete values that these quality scores took on. It would be interesting to analyze more extreme ratings (i.e. wines with ratings of 1, 2, 9, or 10) as well as more nuanced ratings (perhaps by taking the mean of the experts’ ratings rather than the median).
Nevertheless, after exploring the wines in this dataset, we were able to find two variables, alcohol content and volatile acidity, that correlate moderately well with quality, and two variables, sulphate concentration and citric acid, that may explain additional variance in quality. In several cases, we saw relationships departing from linearity – sometimes due to edge effects, but sometimes not. Presumably, each chemical features has a “sweet spot” above or below which quality suffers. It would be interesting to do similar exploratory analyses with other popular types of red wines and see whether they compare.
Finally, we note that Cortez et al. built a predictive model of quality that suggests relationships not observed. The highest-weighted predictor in their SVM were sulphates, pH, and total sulfur dioxide, while the lowest-weighted feature predictor was citric acid. Investigating the nature of these discrepancies is a subject for future exploration.